Belief Propagation for Continuous State Spaces: 
Stochastic Message-Passing with Quantitative Guarantees 

Nima Noorshams 1 Martin J. Wainwright 1 ' 2 

nshams@eecs . berkeley . edu wainwrig@eecs . berkeley . edu 



Department of Statistics 2 and 
Department of Electrical Engineering & Computer Science 1 
University of California Berkeley 

"(N 

q ! December 2012 

(N 

Abstract 

O 

D ' The sum-product or belief propagation (BP) algorithm is a widely used message- 

, passing technique for computing approximate marginals in graphical models. We in- 

troduce a new technique, called stochastic orthogonal series message-passing (SOSMP), 
for computing the BP fixed point in models with continuous random variables. It is 
based on a deterministic approximation of the messages via orthogonal series expansion, 
and a stochastic approximation via Monte Carlo estimates of the integral updates of the 
basis coefficients. We prove that the SOSMP iterates converge to a ^-neighborhood of 
^ , the unique BP fixed point for any tree-structured graph, and for any graphs with cycles 

in which the BP updates satisfy a contractivity condition. In addition, we demonstrate 
how to choose the number of basis coefficients as a function of the desired approximation 
accuracy 5 and smoothness of the compatiblity functions. We illustrate our theory with 
^ , both simulated examples and in application to optical flow estimation. 

o 
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1 Introduction 



Graphical models provide a parsimonious yet flexible framework for describing probabilistic 
dependencies among large numbers of random variables. They have proven useful in a va- 
riety of application domains, including computational biology, computer vision and image 
processing, data compression, and natural language processing, among others. In all of these 
applications, a central computational challenge is the marginalization problem, by which we 
mean the problem of computing marginal distributions over some subset of the variables. 
Naively approached, such marginalization problems become intractable for all but toy prob- 
lems, since they entail performing summation or integration over high-dimensional spaces. 
The sum-product algorithm, also known as belief propagation (BP), is a form of dynamic 
programming that can be used to compute exact marginals much more efficiently for graph- 
ical models without cycles, known as trees. It is an iterative algorithm in which nodes in 
the graph perform a local summation/integration operation, and then relay results to their 
neighbors in the form of messages. Although it is guaranteed to be exact on trees, it is 
also commonly applied to graphs with cycles, in which context it is often known as loopy 
BP. For more details on graphical models and belief propagation, we refer the readers to the 
papers [El EZl El E51 EB] ■ 

In many applications of graphical models, we encounter random variables that take on 
continuous values (as opposed to discrete). For instance, in computer vision, the problem 
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of optical flow calculation is most readily formulated in terms of estimating a vector field 
in K 2 . Other applications involving continuous random variables include tracking problems 
in sensor networks, vehicle localization, image geotagging, and protein folding in compu- 
tational biology. With certain exceptions (such as multivariate Gaussian problems), the 
marginalization problem is very challenging for continuous random variables: in particular, 
the messages correspond to functions, so that they are expensive to compute and transmit, 
in which case belief propagation may be limited to small-scale problems. Motivated by this 
challenge, researchers have proposed different techniques to reduce complexity of BP in dif- 
ferent applications [3J [25l [121 El [221 [H]. For instance, various types of quantization 
schemes [3 [13] have used to reduce the effective state space and consequently the complexity. 
In another line of work, researchers have proposed stochastic methods inspired by particle 
filtering [31 [251 S H] • These techniques are typically based on approximating the messages as 
weighted particles [U [12] , or mixture of Gaussians [25] . Other researchers [23] have proposed 
the use of kernel methods to simultaneously estimate parameters and compute approximate 
marginals in a simultaneous manner. 

In this paper, we present a low-complexity alternative to belief propagation with conti- 
nous variables. Our method, which we refer to as stochastic orthogonal series message-passing 
(SOSMP), is applicable to general graphical models, and is equipped with various theoretical 
guarantees. As suggested by its name, the algorithm is based on combining two ingredients: 
orthogonal series approximation of the messages, and the use of stochastic updates for effi- 
ciency. In this way, the SOSMP updates lead to a randomized algorithm with substantial 
reductions in communication and computational complexity. Our main contributions are to 
analyze the convergence properties of the SOSMP algorithm, and to provide rigorous bounds 
on the overall error as a function of the associated computational complexity. In particular, 
for tree-structured graphs, we estabish almost sure convergence, and provide an explicit in- 
verse polynomial convergence rate (Theorem [T]). For loopy graphical models on which the 
usual BP updates are contractive, we also establish similar convergence rates (Theorem [2]) . 
Our general theory provides quantitative upper bounds on the number of iterations required 
to compute a 5-accurate approximation to the BP message fixed point, as we illustrate in the 
case of kernel-based potential functions (Theorem [3|) . 

The reminder of the paper is organized as follows. We begin in Section [21 with the 
necessary background on the graphical models as well as the belief propagation algorithm. 
Section [3] is devoted to a precise description of SOSMP algorithm. In Section HI we state our 
main theoretical results and develop some of their corollaries. In order to demonstrate the 
algorithm's effectiveness and confirm theoretical predictions, we provide some experimental 
results, on both synthetic and real data, in Section [5j In Section [61 we provide the proofs of 
our main results, with some of the technical aspects deferred to the appendices. 

2 Background 

We begin by providing some background on graphical models and the belief propagation (or 
sum-product) algorithm. 

2.1 Undirected graphical models 

Consider an undirected graph Q = (V,£), consisting of a collection of nodes or vertices 
V = {l,2,...,n}, along with a collection of edges £ C V x V. An edge is an undirected pair 
(u,v), and self-edges are forbidden (meaning that (u,u) ^ £ for all u G V). For each u G V, 
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let X u be a random variable taking values in a space X u . An undirected graphical model, also 
known as a Markov random field, defines a family of joint probability distributions over the 
random vector X = {X u , u G V}, in which each distribution must factorize in terms of local 
potential functions associated with the cliques of the graph. In this paper, we focus on the 
case of pairwise Markov random fields, in which case the factorization is specified in terms of 
functions associated with the nodes and edges of the graph. 

More precisely, we consider probability densities p that are absolutely continuous with 
respect to a given measure fi, typically the Lebesgue measure for the continuous random 
variables considered here. We say that p respects the graph structure if it can be factorized in 
the form 



p(x 1 ,X 2 , . . . ,X n ) OC Y[ i>u{Xu) ] { Ipv 



{ X^i , X7; 



1) 



uev 



Here ip u : X u — > (0, 00) is the node potential function, whereas ip uv : X u x X v — > (0, 00) denotes 
the edge potential function. A factorization of this form ([I]) is also known as pairwise Markov 
random field; see Figured] for a few examples that are widely used in practice. 

In many applications, a central computational challenge is the computation of the marginal 
distribution 



x 



X 



p(x 1 ,x 2 , ...,x n ) 



(n— 1) times 



J [ V{dXy) 

i>ev\{«} 



at each node u G V. Naively approached, this problem suffers from the curse of dimensionality, 
since it requires computing a multi-dimensional integral over an (n — l)-dimensional space. 
For Markov random fields defined on trees (graphs without cycles), part of this exponential 
explosion can be circumvented by the use of the belief propagation or sum-product algorithm, 
to which we turn in the following section. 
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Figure 1. Examples of pairwise Markov random fields, (a) Two-dimensional grid, (b) Markov 
chain model. Potential functions ip u and ip v are associated with nodes u and v respectively, 
whereas potential function ip uv is associated with edge (u,v). 

Before proceeding, let us make a few comments about the relevance of the marginals in 
applied problems. In a typical application, one also makes independent noisy observations y u 
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of each hidden random variable X u . By Bayes' rule, the posterior distribution of X given the 
observations y = (yi, . . . , y n ) then takes the form 

Px\y(xi,---,Xu \yi,---,y n ) oc ] J < • 

«ev (u,v)es 

where we have introduced the convenient shorthand for the modified node-wise potential 
functions ip u (x u ;y u ) := p(y u I %u) iftu{ x u)- Since the observation vector y is fixed and known, 
any computational problem for the posterior distribution (J3j) can be reduced to an equivalent 
problem for a pairwise Markov random field of the form ([I]), using the given definition of the 
modified potential functions. In addition, although our theory allows for distinct state spaces 
X u at each node u E V, throughout the remainder of the paper, we suppress this possible 
dependence so as to simplify exposition. 

2.2 Belief propagation 

The belief propagation algorithm, also known as the sum-product algorithm, is an iterative 
method based on message-passing updates for computing either exact or approximate marginal 
distributions. For trees (graphs without cycles), it is guaranteed to converge after a finite 
number of iterations and yields the exact marginal distributions, whereas for graphs with 
cycles, it yields only approximations to the marginal distributions. Nonetheless, this "loopy" 
form of belief propagation is widely used in practice. Here we provide a very brief treatment 
sufficient for setting up the main results and analysis of this paper, referring the reader to 
various standard sources |14l [27] for further background. 

In order to define the message-passing updates, we require some further notation. For 
each node v E V, let N[v) := {u E V \ (u,v) E £} be its set of neighbors, and we use 
£{v) := {(v — > u) | u E AA(w)} to denote the set of all directed edges emanating from v. We 
use £ := U„ev £(v) to denote the set of all directed edges in the graph. Let M. denote the set 
of all probability densities (with respect to the base measure \x) defined on the space X — that 
is 

M = {m : X ->• [0,oo) | / m(x)fj,(dx) = l}. 

Jx 

The messages passed by the belief propagation algorithm are density functions, taking values 
in the space M.. More precisely, wg assign one message tti v ^. u G J\A. to every directed edge 
(v —> u) E 6 , and we denote the collection of all messages by tti — \tti v —^ u ^ (y -»■ u) E £}. 
Note that the full collection of messages m takes values in the product space 

At an abstract level, the belief propagation algorithm generates a sequence of message 
densities {m*} in the space where t = 0,1,2... is the iteration number. The up- 

date of message m* to message m t+1 can be written in the form m t+l = ^{m 1 ), where 
is a non-linear operator. This global operator is defined by the local up- 
date operator^ one for each directed edge of the graph, such that 



1 It is worth mentioning, and important for the computational efficiency of belief propagation, that m„_> u 
is only a function of the messages m w ^ v for w £ Af(v) \ {u}. Therefore, we have T v ^ u : M dv ~ 1 — > M, where 
d v is the degree of the node v. However, we suppress this local dependence so as to reduce notational clutter. 
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More precisely, in terms of these local updates, the BP algorithm operates as follows. At 
each iteration t = 0, 1, . . ., each node v G V performs the following steps: 

• for each one of its neighbors u G Af(v), it computes m t v ^ u = 7„^ tt (m f ). 

• it transmits message m*l^ to neighbor u G Af(v). 
In more detail, the message update takes the form 

[^«(m*)](0 

V v ' 

m v^u (') 

where k is a normalization constant chosen to enforce the normalization condition 

™£lu( x u) K dx u) = 1- 

By concatenating the local updates (J4|), we obtain a global update operator F : M) £ \ — > Ai^, 
as previously discussed. The goal of belief propagation message-passing is to obtain a fixed 
point, meaning an element m* G Ai^ such that J-{m*) = m* . Under mild conditions, it can 
be shown that there always exists at least one fixed point, and for any tree-structured graph, 
the fixed point is unique. 

Given a fixed point m* , each node u G V computes its marginal approximation t, * G Ai 
by combining the local potential function ip u with a product of all incoming messages as 

)• (5) 

Figure [2] provides a graphical representation of the flow of the information in these local 
updates. For tree-structured (cycle-free) graphs, it is known that BP updates converge to 
the unique fixed point in a finite number of iterations [27] . Moreover, the quantity t*(x u ) is 
equal to the single-node marginal, as previously defined ([2]). For general graphs, uniqueness 
of the fixed point is no longer guaranteed [27]; however, the same message-passing updates 
can be applied, and are known to be extremely effective for computing approximate marginals 
in numerous applications. 

Although the BP algorithm is considerably more efficient than the brute force approach 
to marginalization, the message update equation still involves computing an integral and 
transmitting a real- valued function (message). With certain exceptions (such as multivariate 
Gaussians), these continuous- valued messages do not have finite representations, so that this 
approach is computationally very expensive. Although integrals can be computed by numer- 
ical methods, the BP algorithm requires performing many such integrals at each iteration, 
which becomes very expensive in practice. 

3 Description of the algorithm 

We now turn to the description of the SOSMP algorithm. Before doing so, we begin with some 
background on the main underlying ingredients: orthogonal series expansion, and stochastic 
message updates. 
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Figure 2. Graphical representation of message-passing algorithms, (a) Node v transmits the 
message m v — y u = F v ^, u (rn), derived from equation (|4]), to its neighbor u. (b) Upon receiving 
all the messages, node u updates its marginal estimate according to (0. 



3.1 Orthogonal series expansion 

As described in the previous section, for continuous random variables, each message is a 
density function in the space M C L 2 (A';//). We measure distances in this space using the 
usual I? norm ||/ — g\\\ := f x (f( x ) ~ 9( x )) 2 [J>(dx). A standard way in which to approximate 
functions is via orthogonal series expansion. In particular, let {(f)j} ( jL 1 be an orthonormal 
basis of 1?{X\ /j,), meaning a collection of functions such that 

J 1 when i = j 

(j>i(x)(j>j(x)ii{dx) = < (6) 
U otherwise. 



x 



■ = (<t>i,<t>j) L 2 

Any function / 6 A4 C L 2 (X;fi) then has an expansion of the form / = Y^jL\ a j ( t > j^ wnere 
aj = (f, (j)j)L 2 are the basis expansion coefficients. 

Of course, maintaining the infinite sequence of basis coefficients {aj} ( j2 =1 is also computa- 
tionally intractable, so that any practical algorithm will maintain only a finite number r of 
basis coefficients. For a given r, we let f r oc [ Y7j=i a j ( Pj] + he the approximation based on the 
first r coefficients. (Applying the operator [t] + = max{0,i} amounts to projecting X^f=i a o ( t > o 
onto the space of non-negative functions, and we also normalize to ensure that it is a density 
function.) In using only r coefficients, we incur the approximation error 

(') r 00 

II/, -/Hi < ll£^-/H2 ( = } E 4 ( 7 ) 

j=l j=r+l 

where inequality (i) uses non-expansivity of the projection, and step (ii) follows from Parseval's 
theorem. Consequently, the approximation error will depend both on 

• how many coefficients r that we retain, and 

• the decay rate of the expansion coefficients {aj}^ =1 . 

For future reference, it is worth noting that the local message update (jU) is defined in 
terms of an integral operator of the form 

/(•) ^ / it> ) f(x v ) n{dx v ). (8) 

Jx 
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Consequently, whenever the edge potential function %{j uv has desirable properties — such as 
differentiability and/or higher order smoothness — then the messages also inherit these prop- 
erties. With an appropriate choice of the basis {(f)j}j° =1 , such properties translate into decay 
conditions on the basis coefficients {a,j} ( jL 1 . For instance, for a-times differentiable functions 
expanded into the Fourier basis, the Riemann-Lebesgue lemma guarantees that the coefficients 
cij decay faster than (l/j) 2a . We develop these ideas at greater length in the sequel. 

3.2 Stochastic message updates 

In order to reduce the approximation error (|7|), the number of coefficients r needs to be 
increased (as a function of the ultimate desired error 5). Since increases in r lead to increases 
in computational complexity, we need to develop effective reduced-complexity methods. In 
this section, we describe (at a high-level) how this can be done via a stochastic version of the 
BP message-passing updates. 

We begin by observing that message update @, following some appropriate normalization, 
can be cast as an expectation operation. This equivalence is essential, because it allows us 
to obtain unbiased approximations of the message update using stochastic techniques. In 
particular, let us define the normalized compatibility function 

r w (-, x v ) := ipuv(-,Xv) / where P uv (x v ) := ip v (x v ) / ip uv (x u ,x v ) n(dx u ). (9) 

Puv{Xv) Jx 

By construction, for each x v , we have j x T uv (x u , x v )fj,(dx u ) = 1. 

Lemma 1. Given an input collection of messages m, let Y be a random variable with density 
proportional to 

[P (m)]{y) oc Puviv) ] { (y). (io) 

w<=Af(v)\{u} 

Then the message update equation (j4| can be written as 

[JV+uMKO = E Y [r uv (-,Y)]. (ii) 

Proof. Let us introduce the convenient shorthand M(y) = F[ m w -+ v (y). By defini- 

weAf(v)\{u} 

tion (jl]) of the message update, we have 

lT f v 1M J x (tpnv(-,y)tpu{y)M(y)fM(dy) 



fx fx {4>uv{x,y) 4> u (y)M(y)) n{dy) n{dx) 

Since the integrand is positive, by Fubini's theorem [9], we can exchange the order of integrals 
in the denominator. Doing so and simplifying the expression yields 

Jx J x Wuv{x,y)fi(dx) \ x j3 uv (z)M(z)^(dz) 

S v ' * v ' 

T U v{-,y) \pv->u(m)](y) 
which establishes the claim. □ 

Based on Lemma [H we can obtain a stochastic approximation to the message update by 
drawing k i.i.d. samples Yi from the density (|10p . and then computing y^— i Vuvi'i 

Yi) I k. 

Given the non- negativity and chosen normalization of T uv , note that this estimate belongs to 
M by construction. Moreover, it is an unbiased estimate of the correctly updated message, 
which plays an important role in our analysis. 
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3.3 Precise description of the algorithm 



The SOSMP algorithm involves a combination of the orthogonal series expansion techniques 
and stochastic methods previously described. Any particular version of the algorithm is 
specified by the choice of basis {4>j}JLi and two positive integers: the number of coefficients 
r that are maintained, and the number of samples k used in the stochastic update. Prior to 
running the algorithm, for each directed edge (v — > u), we pre-compute the inner products 

luv,j{x v ) := / Y uv {x u ,x v )(j)j{x u ) n(dx u ), for j = 1, . . . , r. (13) 
Jx 

v ' 

(T««(-, X v ),<pj{-)) L 2 

When ip uv is a symmetric and positive semidefinite kernel function, these inner products 
have an explicit and simple representation in terms of its Mercer eigendecomposition (see 
Section l4.3p . In the general setting, these r inner products can be computed via standard 
numerical integration techniques. Note that this is a fixed (one-time) cost prior to running 
the algorithm. 

SOSMP algorithm for marginalization: 

1. At time t = 0, initialize the message coefficients 

av->u;j = V r for all j = 1, . . . , r, and (v — > u) 6 £. 

2. For iterations t = 0, 1, 2, . . ., and for each directed edge (v — > u) 

(a) Form the projected message approximation 7n i w ^ v (-) = 
for all w € Af(v) \ {u}. 

(b) Draw k i.i.d. samples Yi from the probability density proportional to 

My) II ^^(y), (14) 

w£Af(v)\{u} 

where j3 uv was previously defined in equation Q. 

(c) Use the samples {Y\, . . . , Yfc} from step (b) to compute 

1 k 

b l + Xj : = I E W^) for 3 = 1, 2, . . . , r, (15) 

i=l 

where the function ^ uv -j is defined in equation (fT3|) . 
(a) For step size rf = l/(i+l), update the r-dimensional message coefficient vectors 

< + 4 u = (1 - V f ) aUu + V'bl^u- (16) 

Figure 3: The SOSMP algorithm for continuous state spaces. 
At each iteration t = 0, 1, 2, . . ., the algorithm maintains an r-dimensional vector of basis 
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Ylj = l a «i-S>«;j^i(') ) 



expansion coefficients 

a v^u = ( a i)^u ; ii ■ ■ ■ ■> a v^u;r) e on directed edge (v — > u) G 5. 

This vector should be understood as defining the current message approximation m t v ^ u on 
edge (v — > u) via the expansion 

r 

n4-*»(0 := 534-Mt^0i(O ( 17 ) 
i=i 

We use a* = {a*^^, (v — ► u) G £] to denote the full set of r \£\ coefficients that are maintained 
by the algorithm at iteration t. With this notation, the algorithm consists of a sequence of 
steps, detailed in Figure El that perform the update a 1 h-> a t+1 , and hence implicitly the 
update m t h-> to* +1 . 

As can be seen by inspection of the steps in Figure [3l each iteration requires 0(rk) floating 
point operations per directed edge, which yields a total of 0{rk \£\) operations per iteration. 



4 Theoretical guarantees 

We now turn to a theoretical analysis of the SOSMP algorithm, and guarantees relative to 
the fixed points of the true BP algorithm. For any tree-structured graph, the BP algorithm 
is guaranteed to have a unique message fixed point m* = {to*^ m , {v — > u) G £}. For graphs 
with cycles, uniqueness is no longer guaranteed, which would make it difficult to compare with 
the SOSMP algorithm. Accordingly, in our analysis of the loopy graph, we make a natural 
contractivity assumption, which guarantees uniqueness of the fixed point m* . 

The SOSMP algorithm generates a random sequence {a*}^ , which define message ap- 
proximations {m*}^ via the expansion (|17p . Of interest to us are the following questions: 

• under what conditions do the message iterates approach a neighborhood of the BP fixed 
point to* as t — > +oo? 

• when such convergence takes place, how fast is it? 

In order to address these questions, we separate the error in our analysis into two terms: 
algorithmic error and approximation error. For a given r, let IT denote the projection operator 
onto the span of {<f>i, ■ ■ ■ , <^v}- In detail, given a function / represented in terms of the infinite 
series expansion / = Y^jLi a j4>j, we have 

r 

n r (/) := »,o,, 

3=1 

For each directed edge (y — > u) G £ , define the functional error 

A*_, u := mU u - U r (mU u ) (18) 

between the message approximation at time t, and the BP fixed point projected onto the first 
r basis functions. Moreover, define the approximation error at the BP fixed point as 

Al_ u := mU u -W{mU u ). (19) 
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Since A* belongs to the span of the first r basis functions, the Pythagorean theorem implies 
that the overall error can be decomposed as 

jl^Wulli^ + jl^-^Jli 2 , ( 20 ) 

Estimation error Approximation error 

Note that the approximation error term is independent of the iteration number t, and can 
only be reduced by increasing the number r of coefficients used in the series expansion. Our 
analysis of the estimation error is based on controlling the \£\ -dimensional error vector 

p 2 (A<) := {||A^J|| 2 , (v^u)e£] G (21) 

and in particular showing that it decreases as 0(l/t) up to a lower floor imposed by the 
approximation error. In order to analyze the approximation error, we introduce the \£\- 
dimensional vector of approximation errors 

p 2 {A r ) := {\\AU U \\ 2 L2 , (v^u)e £} G R^. (22) 

By increasing r, we can reduce this approximation error term, but at the same time, we 
increase the computational complexity of each update. In Section FOl we discuss how to choose 
r so as to trade-off the estimation and approximation errors with computational complexity. 



\m, 



4.1 Bounds for tree-structured graphs 

With this set-up, we now turn to bounds for tree-structured graphs. Our analysis of the 
tree-structured case controls the vector of errors p (A*) using a nilpotent matrix N G M^l*! 6 ! 
determined by the tree structure |19j . Recall that a matrix N is nilpotent with order £ if 
N e = and N^ 1 ^ for some I. As illustrated in Figure HI the rows and columns of N are 
indexed by directed edges. For the row indexed by (y — >• u), there can be non-zero entries only 
for edges in the set {(w —>■«), w G AA(v)\{u}}. These directed edges are precisely those that 
pass messages relevant in updating the message from v to u, so that N tracks the propagation 
of message information in the graph. As shown in our previous work (see Lemma 1 in the 
paper [19]), the matrix N with such structure is nilpotent with degree at most the diameter 
of the tree. (In a tree, there is always a unique edge-disjoint path between any pair of nodes; 
the diameter of the tree is the length of the longest of these paths.) 

Moreover, our results on tree-structured graphs impose one condition on the vector of 
approximation errors A r , namely that 

inf rr(r w (x,y)) > 0, and \AU u {x)\ < \ ini : U T {T uv (x, y)) (23) 

for all x G X and all directed edges (v — > u) G £. This condition ensures that the L 2 -norm 
of the approximation error is not too large relative to the compatibility functions. Since 
su Px, y &x |n r (r ut ,(a;,y)) - T uv (x,y)\ -)• and sup^g^ \A T v ^ u (x)\ -> as r -)■ +oo, assuming 
that the compatibility functions are uniformly bounded away from zero, condition (|23p will 
hold once the number of basis expansion coefficients r is sufficiently large. Finally, our bounds 
involve the constants 

Bj := max sup (T uv (-,y), (f>j) L 2. (24) 
With this set-up, we have the following guarantees: 
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mi->2 m 2 -n m 2 _>4 m^ 2 



O 




(a) (b) 

Figure 4. (a) A simple tree with \£\ = 3 edges and hence \£\ = 6 directed edges, (b) Structure 
of nilpotent matrix N £ Rl £ l x l £ l defined by the graph in (a). Rows and columns of the matrix 
are indexed by directed edges (v — > u) £ £; for the row indexed by (v — ► u), there can be 
non-zero entries only for edges in the set {(w — > v), w £ Af(v)\{u}}. 



Theorem 1. Suppose that X is closed and bounded, the node and edge potential functions are 
continuous, and that condition (|23p holds. Then for any tree- structured model, the sequence 
of messages {m*}££. generated by the SOSMP algorithm have the following properties: 

(a) There is a nilpotent matrix N E Rl^N^I such that the error vector /9 2 (A*) converges 
almost surely to the set 

B:={e€R^ | \e\ < N{I - N)- 1 p 2 {A r )}, (25) 
where < denotes elementwise inequality for vectors. 

(b) Furthermore, for all iterations t = 1, 2, . . we have 

E[p 2 (A*)] < (Q (/ ~ logtiVrl (N1 + 16) + N(I-N)-V(A r ). (26) 

j=i 

To clarify the statement in part (a), it guarantees that the difference /o 2 (A*) — ng(p 2 (A*)) 
between the error vector and its projection onto the set B converges almost surely to zero. 
Part (b) provides a quantitative guarantee on how quickly the expected absolute value of this 
difference converges to zero. In particular, apart from logarithmic factors in t, the convergence 
rate guarantees is of the order 0(l/t). 

4.2 Bounds for general graphs 

Our next theorem addresses the case of general graphical models. The behavior of the ordi- 
nary BP algorithm to a graph with cycles — in contrast to the tree-structured case — is more 
complicated. On one hand, for strictly positive potential functions (as considered in this pa- 
per), a version of Brouwer's fixed point theorem can be used to established existence of fixed 
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points [27]. However, in general, there may be multiple fixed points, and convergence is not 
guaranteed. Accordingly, various researchers have studied conditions that are sufficient to 
guarantee uniqueness of fixed points and/or convergence of the ordinary BP algorithm: one 
set of sufficient conditions, for both uniqueness and convergence, involve assuming that the 
BP update operator is a contraction in a suitable norm (e.g., [2"6 | [TT | \T7 \ l2"T]). 

In our analysis of the SOSMP algorithm for a general graph, we impose the following form 
of contractivity: there exists a constant < 7 < 2 such that 



\T v ->u{jn) - F v ^ u {m')\\ L 2 < (l - -) 



i 



1 



\Af(v)\{u}\ ^ 

1 weM(v)\{u} 



m 



Ul— >D II Jji 1 



(27) 



for all directed edges (v — > u) £ £, and feasible messages m, and m'. We say that the ordinary 
BP algorithm is 7-contractive when condition (|27p holds. 



Theorem 2. Suppose that the ordinary BP algorithm is ^-contractive (|27p . and consider 
the sequence of messages {m 1 }^^ generated with step-size rf = \/(^{t + 1)). Then for all 
t = 1,2, . . the error sequence {A*^^}^ is bounded in mean-square as 



E[p 2 (A*)] 



-< 



>Y7j=x B j \ logt 1 .. . r ll2 

1 +- max JA r v _,J 2 L2 



1. (28) 



I 2 J t 7 («->•«) e£ 

where A 7 V _^ U = m* v ^ u — W{m* v ^ u ) is the approximation error on edge (v — > u). 

Theorem [2] guarantees that under the contractivity condition (|27"|) . the SOSMP iterates 
converge to a neighborhood of the BP fixed point. The error offset depends on the approxima- 
tion error term that decays to zero as r is increased. Moreover, disregarding the logarithmic 
factor, the convergence rate is 0(l/t), which is the best possible for a stochastic approximation 
scheme of this type [ISl H] . 



4.3 Explicit rates for kernel classes 

Theorems Q] and [2] are generic results that apply to any choices of the edge potential functions. 
In this section, we pursue a more refined analysis of the number of arithmetic operations that 
are required to compute a 5-uniformly accurate approximation to the BP fixed point m* , where 
5 > is a user-specified tolerance parameter. By a 5-uniformly accurate approximation, we 
mean a collection of messages m such that 

max JEfHm^-Mi - m* v ^ u \\ 2 L2 ] < 5. (29) 
(v— >u)e£ 

In order to obtain such an approximation, we need to specify both the number of coefficients 
r to be retained, and the number of iterations that we should perform. Based on these 
quantities, our goal is to the specify the minimal number of basic arithmetic operations T(5) 
that are sufficient to compute a <5-accurate message appproximation. 

In order to obtain concrete answers, we study this issue in the context of kernel-based 
potential functions. In many applications, the edge potentials ip uv : X x X — > M + are sym- 
metric and positive semidefinite (PSD) functions, frequently referred to as kernel functions!! 

2 In detail, a PSD kernel function has the property that for all natural numbers n and {xi, . . . , x n } C X, 
the n x n kernel matrix with entries ^ U v(xi, Xj) is symmetric and positive semidefinite. 
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Commonly used examples include the Gaussian kernel tp U v(x,y) = exp(— j\\x — 1 1 1 ) , the 
closely related Laplacian kernel, and other types of kernels that enforce smoothness priors. 
Any kernel function defines a positive semidefinite integral operator, namely via equation 0. 
When X is compact and the kernel function is continuous, then Mercer's theorem |20j guar- 
antees that this integral operator has a countable set of eigenfunctions {(pj}JLi that form an 
orthonormal basis of I?(X\ Moreover, the kernel function has the expansion 

oo 

i>uv{x,y) = (f)j(x) (t>j(y), (30) 

j=i 

where Ai > A2 > • • • > are the eigenvalues, all guaranteed to be non-negative. In general, the 
eigenvalues might differ from edge to edge, but we suppress this dependence for simplicity in 
exposition. We study kernels that are trace class, meaning that the eigenvalues are absolutely 
summable (i.e., YlJLi Xj < °°)- 

For a given eigenvalue sequence {Xj}°?L 1 and some tolerance 5 > 0, we define the critical 
dimension r* = r*(5; {Xj}) to be the smallest positive integer r such that 

A r < 5. (31) 

Since Xj — > 0, the existence of r* < +00 is guaranteed for any 5 > 0. 

Theorem 3. In addition to the conditions of Theorem^ suppose that the compatibility func- 
tions are defined by a symmetric PSD trace-class kernel with eigenvalues {Xj}. If we run the 
SOSMP algorithm with r* = r*(5;{Xj}) basis coefficients, then it suffices to perform 

r* 

T(5;{Xj}) = 0(r* (^A, 2 ) (l/<5) Iog(l/J)) (32) 
i=i 

arithmetic operations per edge in order to obtain a 5-accurate message vector m. 

The proof of Theorem [3] is provided in Section 16.31 It is based on showing that the choice f|31|) 
suffices to reduce the approximation error to 0(5), and then bounding the total operation 
complexity required to also reduce the estimation error. 

Theorem [3] can be used to derive explicit estimates of the complexity for various types of 
kernel classes. We begin with the case of kernels in which the eigenvalues decay at an inverse 
polyomial rate: in particular, given some a > 1, we say that they exhibit a-polynomial decay 
if there is a universal constant C such that 

Xj<C/j a for all j = 1,2,.... (33) 

Examples of kernels in this class include Sobolov spline kernels |10| . which are a widely used 
type of smoothness prior. For example, the spline class associated with functions that are 
s-times differentiable satisfies the decay condition f)33[) with a = 2s. 

Corollary 1. In addition to the conditions of Theorem suppose that the compatibility 
functions are symmetric kernels with a-polynomial decay (|33|) . Then it suffices to perform 

T polj (5) = o({l/5) 1 ^log(l/5)) (34) 

operations per edge in order to obtain a 5-accurate message vector m. 
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The proof of this corollary is immediate from Theorem [3) given the assumption Xj < C/j a , 
we see that r* < (C/6)« and YTjLi A j = Substituting into the bound ([32]) yields the 

claim. Corollary [T] confirms a natural intuition — namely, that it should be easier to compute 
an approximate BP fixed point for a graphical model with smooth potential functions. Dis- 
regarding the logarithmic factor (which is of lower-order), the operation complexity T po \ y (5) 
ranges ranges from 0((l/5) 2 ), obtained as a — > 1 + all the way down to 0(l/5\, obtained as 
a — > +00. 

Another class of widely used kernels are those with exponentially decaying eigenvalues: 
in particular, for some a > 0, we say that the kernel has a- exponential decay if there are 
universal constants (C, c) such that 

Xj < C exp(-cT) for all j = 1,2,.... (35) 

Examples of such kernels include the Gaussian kernel, which satisfies the decay condition ([35]) 
with a = 2 (e.g., [24]). 

Corollary 2. In addition to the conditions of Theorem [|J suppose that the compatibility 
functions are symmetric kernels with a-exponential decay (|35p . Then it suffices to perform 

T exp (5) = o([l/S) (logU/S))^). (36) 

operations per edge in order to obtain a uniformly 5-accurate message vector m. 

As with our earlier corollary, the proof of this claim is a straightforward consequence of 
Theorem [3l Corollary [2] demonstrates that kernel classes with exponentially decaying eigen- 
values are not significantly different from parametric function classes, for which a stochastic 
algorithm would have operation complexity 0(1/ 5). Apart from the lower order logarithmic 
terms, the complexity bound (|36p matches this parametric rate. 

5 Experimental Results 

In this section, we describe some experimental results that help to illustrate the theoretical 
predictions of the previous section. 

5.1 Synthetic Data 

We begin by running some experiments for a simple model, in which both the node and 
edge potentials are mixtures of Gaussians. More specifically, we form a graphical model with 
potential functions of the form 

3 

ipu(xu) = ^2 ^ exp {~( x u- Vu;i) 2 1 '(2cr„.j)) , for all u £ V, and (37a) 
i=l 
3 

*Puv(x u , x v ) = TT uv -i exp (-(x v - x n ) 2 /(2(T^ l ,. i )) for all (it, v) G £ , (37b) 
i=l 

where the non- negative mixture weights are normalized (i.e., Yli=i 7r ™;« = Yli=i 7r ";» = -*-)• 
For each vertex and edge and for all i = 1,2, 3, the mixture parameters are chosen randomly 
from uniform distributions over the range c„.j, c^-i e (0,0.5] and [A u; i £ [—3,3]. 
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Figure 5. Plot of normalized error e*/e° vs. the number of iterations t for 10 different sample 
paths on a chain of size n — 100. The dashed lines are sample paths whereas the solid line 
is the mean square error. In this experiment node and edge potentials are mixtures of three 
Gaussians ([57)1 and we implemented SOSMP using the first r — 10 Fourier coefficients with 
k = 5 samples. 



For a chain-structured graph with n = 100 nodes, we first compute the fixed point of 
standard BP, using direct numerical integration to compute the integralsjl so to compute (an 
extremely accurate approximation of) the fixed point m* . We compare this "exact" answer to 
the approximation obtained by running the SOSMP algorithm using the first r = 10 Fourier 
basis coefficients and k = 5 samples. Having run the SOSMP algorithm, we compute the 
average squared error 



at each time t = 1, 2, 

Figure [5] provides plots of the error (|38p versus the number of iterations for 10 different 
trials of the SOSMP algorithm. (Since the algorithm is randomized, each path is slightly 
different.) The plots support our claim of of almost sure convergence, and moreover, the 
straight lines seen in the log-log plots confirm that convergence takes place at a rate inverse 
polynomial in t. 

In the next few simulations, we test the algorithm's behavior with respect to the number 
of expansion coefficients r, and number of samples k. In particular, Figure [6^ a) illustrates the 
expected error, averaged over several sample paths, vs. the number of iterations for different 
number of expansion coefficients r G {2, 3, 5, 10} when k = 5 fixed; whereas Figure[6](b) depicts 
the expected error vs. the number of iterations for different number of samples k G {1,2,5, 10} 
when r = 10 is fixed. As expected, in Figure [6]^a), the error decreases monotonically, with the 
rate of 1/t, till it hits a floor corresponding the offset incurred by the approximation error. 
Moreover, the error floor decreases with the number of expansion coefficients. On the other 
hand, in Figure EJ^b), increasing the number of samples causes a downward shift in the error. 

3 In particular, we approximate the integral update @ with its Riemann sum over the range X — [—5, 5] 
and with 100 samples per unit time. 




r 



(38) 
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(a) (b) 

Figure 6. Normalized mean squared error E[e*/e°] verses the number of iterations for 
a Markov chain with n — 100 nodes, using potential functions specified by the mixture of 
Gaussians model (|3"T)) . (a) Behavior as the number of expansion coefficients is varied over the 
range r £ {2, 3, 5, 10} with k = 5 samples in all cases. As predicted by the theory, the error 
drops monotonically with the number of iterations until it hits a floor. The error floor, which 
corresponds to the approximation error incurred by message expansion truncation, decreases 
as the number of coefficients r in increased, (b) Mean squared error E[e ] verses the number of 
iterations t for different number of samples k £ {1, 2, 5, 10}, in all cases using r — 10 coefficients. 
Increasing the number of samples k results in a downward shift in the error. 



This behavior is also expected since increasing the number of samples reduces the variance of 
the empirical expectation in equation (|15p . 

In our next set of experiments, still on a chain with n = 100 vertices, we test the behavior of 
the SOSMP algorithm on graphs with edge potentials of varying degrees of smoothness. In all 
cases, we use node potentials from the Gaussian mixture ensemble (|37|) previously discussed, 
but form the edge potentials in terms of a family of kernel functions. More specifically, 
consider the basis functions 

<t>j{x) = sin ((2j - l)vr(x + 5)/10) for j = 1,2, 

each defined on the interval [—5,5]. It is straightforward that the family }°5L 1 forms an 
orthonormal basis of L 2 [— 5, 5]. We use this basis to form the edge potential functions 

1000 

i>uv(x,y) = ]T(l/j)>i(20 hiv), (39) 
i=i 

where a > is a parameter to be specified. By construction, each edge potential is a positive 
semidefinite kernel function satisfying the a-polynomial decay condition (|33p . 

Figure [7] illustrate the error curves for two different choices of the smoothness parameter: 
panel (a) shows a = 0.1, whereas panel (b) shows a = 1. For the larger value of a shown 
in panel (b), the messages in the BP algorithm are smoother, so that the SOSMP estimates 
are more accurate with the same number of expansion coefficients. Moreover, similar to what 
we have observed previously, the error decays with the rate of \/t till it hits the error floor. 
Note that this error floor is lower for the smoother kernel (a = 1) compared to the rougher 
case (a = 0.1); note the difference in axis scaling between panels (a) and (b). Moreover, as 
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Figure 7. Plot of the estimation error e*/e° verses the number of iterations t for the cases of 
(a) a = 0.1 and (b) a = 1. The BP messages are smoother when a = 1, and accordingly the 
SOSMP estimates are more accurate with the same number of expansion coefficients. Moreover, 
the error decays with the rate of 1/t till it hits a floor corresponding to the approximation error 
incurred by truncating the message expansion coefficients. 



predicted by our theory, the approximation error decays faster for the smoother kernel, as 
shown by the plots in Figure in which we plot the final error, due purely to approximation 
effects, versus the number of expansion coefficients r. The semilog plot of Figure [S] shows that 
the resulting lines have different slopes, as would be expected. 

5.2 Computer Vision Application 

Moving beyond simulated problems, we conclude by showing the SOSMP algorithm in appli- 
cation to a larger scale problem that arises in computer vision — namely, that of optical flow 
estimation [4j. In this problem, the input data are two successive frames of a video sequence. 
We model each frame as a collection of pixels arrayed over a y/n x y/n grid, and measured 
intensity values at each pixel location of the form j), I' (i, j)}ff = i ■ Our goal is to estimate 
a 2-dimensional motion vector x u = (x u -i,x u -2) that captures the local motion at each pixel 
u = (i, j), i,j = 1, 2, ... , y/n of the image sequence. 

In order to cast this optical flow problem in terms of message-passing on a graph, we adopt 
the model used by Boccignone et al. [5]. We model the local motion X u as a 2-dimensional 
random vector taking values in the space X = [— d, d] x [— d, d], and associate the random 
vector X u with vertex u, in a 2-dimensional grid (see Figure [Ha)). At node u = we use 

the change between the two image frames to specify the node potential 



ipu(x u -i,x u . 2 ) oc exp 



(J(z, j) - + x u; i, j + x u - 2 )) 2 



) 



On each edge (u,v), we introduce the potential function 




which enforces a type of smoothness prior over the image. 
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Figure 8. Final approximation error vs. the number of expansion coefficients for the cases of 
a = 0.1 and a = 1. As predicted by the theory, the error floor decays with a faster pace for 
the smoother edge potential. 



To estimate the motion of a truck, we applied the SOSMP algorithm using the 2-dimensional 
Fourier expansion as our orthonormal basis to two 250 x 250 frames from a truck video se- 
quence (see Figured]). We apply the SOSMP algorithm using the first r = 9 coefficients and 
k = 3 samples. Figure [TO] shows the HSV (hue, saturation, value) codings of the estimated 
motions after t = 1,10,40 iterations, in panels (a), (b) and (c) respectively. (Panel (d) pro- 
vides an illustration of the HSV encoding: hue is used to represent in the angular direction 
of the motion whereas the speed (magnitude of the motion) is encoded by the saturation 
(darker colors meaning higher speeds). The initial estimates of the motion vectors are noisy, 
but it fairly rapidly converges to a reasonable optical flow field. (To be clear, the purpose 
of this experiment is not to show the effectiveness of SOSMP or BP as a particular method 
for optical flow, but rather to demonstrate its correctness and feasibility of the SOSMP in an 
applied setting.) 

6 Proofs 

We now turn to the proofs of our main results. They involve a collection of techniques from 
concentration of measure, stochastic approximation, and functional analysis. 

6.1 Proof of Theorem CO 

Our goal is to bound the error 

r 

W^Jh = W^u-^ r {<^u)\\h = EK + ^- a U-) 2 - ( 4 °) 

where the final equality follows by Parseval's theorem. Here {a*_ j>u . J }^ =1 are the basis expan- 
sion coefficients that define the best r approximation to the BP fixed point m* . The following 
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(a) (b) 

Figure 9. Two frames, each of dimension 250 x 250 pixels, taken from a video sequence of 
moving cars. 



lemma provides an upper bound on this error in terms of two related quantities. First, we 
let {&«_> u .j}p=i denote the basis function expansion coefficients of the J 7 v ^ u (fh t v ^. u ) — that is, 
[Fv-Ht(ffi>ii-tu)](') = YlJLiK-^u-j^ji')- Second, for each j = 1,2, ...,r, define the deviation 
Cvl+Ly := fr^uy ~ K-^u;j-> wnere the coefficients &*t^ u .j are updated in Step 2(c) Figure El 
Lemma 2. For each iteration t = 0, 1, 2, . . we have 

W^uWh < ^7 tt [K-, u ,-aU u ,] 2 + E{EC^} 2 (41) 

1 + 1 j=l r=0 ^ + ^ j=l r=0 

Deterministic term D^~ u Stochastic term S 1 *!^ 

The proof of this lemma is relatively straightforward; see Appendix [A] for the details. Note 
that inequality (1411) provides an upper bound on the error that involves two terms: the first 
term D^l u depends only on the expansion coefficients {bj,_^ u .j, r = 0, . . . ,t} and the BP fixed 
point, and therefore is a deterministic quantity when we condition on all randomness in stages 
up to step t. The second term even when conditioned on randomness through step t, 

remains stochastic, since the coefficients tf^ u (involved in the error term Cv^+u) are updated 
stochastically in moving from iteration t to t + 1. 

We split the remainder of our analysis into three parts: (a) control of the deterministic 
component; (b) control of the stochastic term; and (c) combining the pieces to provide a 
convergence bound. 

6.1.1 Upper-bounding the deterministic term 

By the Pythagorean theorem, we have 

t r t 
EEtU-di] 2 ^ £||^ M (m*)-.F^uK)||£ 2 (42) 

T = j = l T = 
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Figure 10. Color coded images of the estimated motion vectors after (a) t = 1, (b) t = 10, 
(c) t = 40 iterations. Panel (d) illustrates the hsv color coding of the flow. The color hue is 
used to encode the angular dimension of the motion, whereas the saturation level corresponds 
to the speed (length of motion vector). We implemented the SOSMP algorithm by expanding 
in the two-dimensional Fourier basis, using r — 9 coefficients and k — 3 samples. Although the 
initial estimates are noisy, it converges to a reasonable optical flow estimate after around 40 
iterations. 
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In order to control this term, we make use of the following lemma, proved in Appendix iBl 

Lemma 3. For all directed edges (v — > u) G £, there exist constants {L v ^ u - W , w G M{v) \ {u}} 
such that 

v— >u( m ) ~~ J~ v- J >u( m )\\l 2 — ^ 1 Lv— > U ;ui ||'7l t y_ } V — TTl w _^ v \\i l 2 

wEAf(v)\{u} 

for all t = 1, 2, . . .. 

Substituting the result of Lemma [3] in equation (|42p and performing some algebra, we find 
that 

t r t 2 

XT \P~v-^u;i ~ a v^u;j] — ^2 y ^2 L v ^ u - W Wm^j^ — rn w ^ v \\ L 2 S j 
r=0j=l r=0 weN(v)\{u} 

t 

< (d v — 1) ^ ^v-^u;w \\ m w^v — m w-^v\\ 2 L2, (43) 

r=0w£Af(v)\{u} 

where d v is the degree of node v E V. By definition, the message fh^^y is the L 2 -projection 
of m^^y onto Ai. Since m* w ^ v G .A/f and projection is non-expansive, we have 

1 1 — t * ii2 \\ T * l|2 

ll m u>— ~~ m ui— >tillL 2 — ll m «>— >v ~ m w— >v\\L 2 

= ||A^J|| 2 + P^J|| 2 (44) 

where in the second step we have used the Pythagorean identity and recalled the definitions 
of estimation error as well as approximation error from (I18D and (I19p , Substituting the 
inequality (|4"4"]l into the bound ([4*3]) yields 



t r t 
y~^y~! — a v-^u;j] — (dv — 1) ^ ^ Ll^ u . w (||A^„||| 2 + H-AL-Hjllia)- 

r=0j=l r=0uieA^(«)\{u} 

Therefore, introducing the convenient shorthand L v ^. UjW := 2 (c?„ — 1) L^ u . w , we have shown 
that 



1 * 

—J X/ X/ L v-^u,w (I|AL^JIl 2 + ll^-+Jli,a) - (45) 



r=0 w£Af{v)\{u} 



We make further use of this inequality shortly. 
6.1.2 Controlling the stochastic term 

We now turn to the stochastic part of the inequality (|4ip . Our analysis is based on the 
following fact, proved in Appendix [Q 

Lemma 4. For each t > 0, let Q l := o~(m°, . . . ,m t ) be the o-field generated by all messages 
through time t. Then for every fixed j = 1, 2, . . . , r, the sequence Cl^} u -j = ^v^u-j ~ tf,-±w,j ^ s a 
bounded martingale difference with respect to {Q t } ( ^ Q . In particular, we have IC^iVjl — 2-Bj> 
where Bj was previously defined (|2~4"|). 
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Based on Lemma [U standard martingale convergence results [9] guarantee that for each 
j = 1, 2, . . . , r, we have X^t=o QXu-j/ + -0 converges to almost surely (a.s.) as i — )■ oo, and 
hence 

sm. = T^ElEc^f - 2 t{^rEc^} 2 ^ a <46) 

V ; j=l k r=0 J j=l 1 r=0 J 

Furthermore, we can apply the Azuma-Hoeffding inequality [6J in order to characterize the 
rate of convergence. For each j = 1,2, ...,r, define the non-negative random variable 
Zj ■= { Et=o CXl-jf/it + l) 2 . Since < 2B 5 , for any 5 > 0, we have 

for all <5 > 0. Moreover, is non-negative; therefore, integrating its tail bound we can 
compute the expectation 

E[Zj] = J F[Zj >5]d5 < 2 J exp I - { J J ) d8 - 



SB 2 J ' I + I 



and consequently 



32 y r _ B 2 

n\Sl^ u \) < ~ 3 ~\ 3 - (47) 



6.1.3 Establishing convergence 

We now make use of the results established so far to prove the claims. Substituting the upper 
bound (|45p on D t +^ u into the decomposition (|4ip from Lemma [2j we find that 

1 * 

H^ti^Hi 2 — i \\ L v -+ U>w {HA^^H^a + ||^£)-».«||ia } + St^tu- (48) 

t=0 weM(v)\{u} 

For convenience, let us introduce the vector T t+l = {T'j^, (y — >• u) € £ } € with entries 



~Y | L v ^ UiW || A^^^H^a | + Sl~t^ u . (49) 



7^+1 

ioe^(u)\{«} 



Now define a matix iV £ ]j|£l x l£| with entries indexed by the directed edges and set to 

)L V ^ U W if s = v and w £ J\f(v) \ {u} 
N v ^ Uj w ^ s := < . (50) 

I U otherwise. 

In terms of this matrix and the error terms p 2 ( • ) previously defined in equations (|2ip and 
the scalar inequalities (|48p can be written in the matrix form 

t 



P 2 (A' +1 ) < N [j^T[ Ep 2 ( AT )} + N f<A r ) + T * +1 ' ( 51 ) 

T=l 
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where X denotes the element-wise inequality based on the orthant cone. 

From Lemma 1 in the paper |19j . the matrix N is guaranteed to be nilpotent with degree 
£ equal to the graph diameter. Consequently, unwrapping the recursion (|5ip for a total of 
£ = diam(<5) times yields 

p 2 (A t+1 ) * T m + NT[ +l + ... + N^ 1 Tj+l + (N + N 2 + ... + N l ) p 2 {A r ), 

where we define Tq +1 = T t+1 , and then recursively T* +1 := (X)r=l ^T-i)/ for a = 1, 2, ... , 
By the nilpotency of N, we have the identity I + N + . . . + iV^ _1 = (I — N)' 1 ; so we can fur- 
ther simplify the last inequality 

i-i 

p 2 (A m ) < ^N s Tl +1 + N (I - N)- 1 p 2 (A r ) . (52) 

s=0 

Recalling the definition B := {e <E M^l | |e| ^ iV(J - iV)~ 1 / 2 (/4 r ) }, inequality (J52]) implies 
that 

|p 2 (A* +1 ) - n B (/> 2 (A m ))| ^ ^ iV s Ti +1 . (53) 

s=0 

We now use the bound (f53|) to prove both parts of Theorem [TJ 

Proof of Theorem [T](a): To prove the almost sure convergence claim in part (a), it suffices 
to show that for each s = 0, 1, ... ,£ — 1, we have T* as t — > +oo. From equation (|46|) 
we know S 1 *^ 1 ,, — >• almost surely as t — > oo. In addition, the first term in (i49|) is at most 
0(l/t), so that also converges to zero as t — > oo. Therefore, we conclude that Tq — as 
t —> oo. 

In order to extend this argument to higher-order terms, let us recall the following elemen- 
tary fact from real analysis [22J: for any sequence of real numbers {x*}^ such that x t — > 0, 
then we also have Q^r=o a; ' r )A ~~ ^ 0- ^ n order to apply this fact, we observe that Tq 
means that for almost every sample point oj the deterministic sequence {Tq +1 (w)}£2. converges 
to zero. Consequently, the above fact implies that T* +1 (o;) = Q^*=i ^o"( CJ ))/(^ + 1) — ^ as 
i — > oo for almost all sample points ui, which is equivalent to asserting that T\ -^-V 0. Iterating 
the same argument, we establish Tj +1 for all s = 0, 1, ...,£ — 1, thereby concluding the 
proof of Theorem QTa) . 

Proof of Theorem [T](b): Taking the expectation on both sides of the inequality ([52]) yields 

l-l 

E[|p 2 (A m ) -n B (p 2 (A m ))|] < ^iV s E[Ti +1 ]. (54) 

s=0 

so that it suffices to upper bound the expectations E[T* +1 ] for s = 0, 1, . . . ,£ — 1. In Ap- 
pendix H2 we prove the following result: 

Lemma 5. Define the \£ \ -vector v := { ££ =1 B 2 }(2N1 + 32). Then for all s = 0, 1, . . . , £ - 1 
and t = 0, 1, 2, . . 

E[T , +1] ^^(M^)TY (55) 
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Using this lemma, the proof of part (b) follows easily. In particular, substituting the 
bounds (|55j) into equation (|54"1) and doing some algebra yields 



-1 s 



E[\ P 2 (A^)-n B (p*(A^))\) * J2 NS E 



(log(t + 1)) U ( V 



s=0 u=0 

e-i 



u< 



(— 

Vfc+1 



^3^(log(i + l)) s iV s (— 



=<3(J - log (t + 1) iV)- 1 



t + l 



where again we used the fact that = 0. 
6.2 Proof of Theorem H 

Recall the definition of the estimation error A*^^ from (fTHj) . By Parseval's identity we 
know that ||A*_ ) . n ||^2 = X^f=i(°v->-ii*j ~ a *v^u-j) 2 ■ For convenience, we introduce the following 
shorthands for mean squared error on the directed edge (v — > u) 

r 

j=i 



as well as the error 



" max V 



max JE[||A$_J|! a ], 
(v— >u)&£ 



similarly defined for approximation error 



Using the definition of p 2 (A* _>„), some algebra yields 
p 2 (A^J - p\AUJ = E [£ * ^ 



i=i 



* \ — (n* —a* V 

3=1 



E 



r 



From the update equation (fTB"j) . we have 



and hence 



a t+1 - a* 



where 



p 2 (A*+ x J - p 2 (A*^J = C^ + Kf_M*. 

r 

r 
3=1 



(56) 



(57a) 



(57b) 



The following lemma, proved in Appendix [Ej provides upper bounds on these two terms. 
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Lemma 6. For all iterations t = 0, 1, 2, . . ., we have 

r 

UU U < 4(ry*) 2 Y, B l and ( 58a ) 
i=i 

V^ u <rf(l- l)p 2 max {A r ) + r?(l- |)pL(A') - rf{l + |)p 2 (A^J. (58b) 

We continue upper-bounding /j 2 (A*"i^J by substituting the results of Lemma [6] into equa- 
tion (|56p . thereby obtaining 



P 2 (A^J < 4(^) 2 J> 2 + ^(1 - 2) pL x (^) + ^(1 " \) PLA*) + {l - iftl + |)}p 2 (AU) 



< 4(ry*) 2 ^B| + ^(l - I)pL.W + (l - W 

Since this equation holds for all directed edges (v — > u), taking the maximum over the left- 
hand side yields the recursion 

pL x (A t+1 ) < {rff B 2 + rf (1 - |) p 2 ax (^ r ) + (1 " H PL X (A 4 ), (59) 

where we have introduced the shorthand B 2 = 4^^ =1 £> 2 . Setting if = 1/(7 (i + 1)) and 
unwrapping this recursion, we find that 

2ffl log(f + l) 1 2 

which establishes the claim. 
6.3 Proof of Theorem [3] 

As discussed earlier, each iteration of the SOSMP algorithm requires 0(r) operations per 
edge. Consequently it suffices to show that running the algorithm with r = r* coefficients for 
(J2j=i ^f)(V^) l°s(l/^) iterations suffices to achieve mean-squared error less than 5. 

The bound (I28p consists of two terms. In order to characterize the first term (estimation 
error) , we need to bound Bj defined in (|24p . Using the orthonormality of the basis functions 
and the fact that the supremum is attainable over the compact space X, we obtain 

B S = max ^sup ^ ^ u , = 0{\ 3 ). 

Therefore, the estimation error decays at the rate 0((X^=i -^ 2 )(l°g : so tnat t = 0(Q^ =1 A 2 ) (1/(5) log (1/5)) 
iterations are sufficient to reduce it to 0(5). 

The second term (approximation error) in the bound (|28|) depends only on the choice of r, 
and in particular on the r-term approximation error H^l^JI 2 -^ = II ""^-m* ~~ ^(m*^^)!! 2 ^- To 
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bound this term, we begin by representing m* v ^ u in terms of the basis expansion Yl^Li a *i$y 
By the Pythagorean theorem, we have 

oo 

\\mU u -TF{mUu)\\h= E K*)'- ( 60 ) 

j=r+l 

Our first claim is that Y^jLi( a *j) 2 / < 00 • Indeed, since m* is a fixed point of the message 
update equation, we have 

!*_>«(•) °c / ^wM^M^)^,,), 



where M(-) := ^(0 ritueA/Xv) \{u} m tu-H>(")- Exchanging the order of integrations using Fu- 
bini's theorem, we obtain 



a* = (m* v _^ u , (pj) L 2 oc / (cf)j(-), ip uv (-,x v )) L 2 M(x v ) fj,(dx v ). (61) 

Jx 

By the eigenexpansion of ip U v, we have 

00 

{<Pj(-), lpuv(-,X v )) L 2 = ^Xk^j, (j) k ) L 2 4>k{x v ) = Xj(j)j{x v ). 

k=l 

Substituting back into our initial equation ([6T]) . we find that 



a* oc Xj / 4>j{x v ) M(x v ) ^(dx v ) = Xj 

where Oj are the basis expansion coefficients of M. Since the space X is compact, one can see 
that M G L 2 (X), and hence Yl°jL\®j < 00 ■ Therefore, we have 

where we used the fact that X^Li < 00. 

We now use this bound to control the approximation error (|60p . For any r = 1, 2, . . ., we 
have 

E («P 2 = E X J ^ X r E = <w> 

j=r+l j=r+l - 1 j=r+l 

using the non-increasing nature of the sequence {Xj}JL 1 . Consequently, by definition of r* 
(|5Tj) . we have 



\\mU u -W (mU u )\\i 2 = 0{5), 

as claimed. 
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7 Conclusion 



Belief propagation is a widely used message-passing algorithm for computing (approximate) 
marginals in graphical models. In this paper, we have presented and analyzed the SOSMP 
algorithm for running BP in models with continuous variables. It is based on two forms of ap- 
proximation: a deterministic approximation that involves projecting messages onto the span 
of r basis functions, and a stochastic approximation that involves approximating integrals by 
Monte Carlo estimates. These approximations, while leading to an algorithm with substan- 
tially reduced complexity, are also controlled: we provide upper bounds on the convergence 
of the stochastic error, showing that it goes to zero as 0(logt/t) with the number of itera- 
tions, and also control on the deterministic error. For graphs with relatively smooth potential 
functions, as reflected in the decay rate of their basis coefficients, we provided a quantitative 
bound on the total number of basic arithmetic operations required to compute the BP fixed 
point to within 5-accuracy. We illustrated our theoretical predictions using experiments on 
simulated graphical models, as well as in a real-world instance of optical flow estimation. 

Our work leaves open a number of interesting questions. First, although we have focused 
exclusively on models with pairwise interactions, it should be possible to develop forms of 
SOSMP for higher-order factor graphs. Second, the bulk of our analysis was performed under 
a type of contractivity condition, as has been used in past work [26 |llll[T71 l21j on convergence 
of the standard BP updates. However, we suspect that this condition might be somewhat 
relaxed, and doing so would demonstrate applicability of the SOSMP algorithm to a larger 
class of graphical models. 
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A Proof of Lemma [2] 

Subtracting a*^ u .j from both sides of the update (fTBl) in Step 2(c), we obtain 



i t+l 



Setting rf = l/(t + 1) and unwrapping the recursion (i62j) then yields 

t t 

a t+1 -a* ■ = V \b T ■ - a* -1 + V C T+1 

t+1 J v-*u;j "*u-+TiyJ ' ^ _|_ ^ / j >>v->u;T 

T=0 T=0 

Squaring both sides of this equality and using the upper bound (a + b) 2 < 2a 2 + 26 2 , we obtain 

( a v^u;j ~ a v^u;j) — ^ _|_ { ^2 [^v^u;j ~ a v^u;j] } + ^ _|_ Jyz { " 
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Summing over indices j = 1, 2, . . . , r and recalling the expansion (|40p . we find that 
IIAf 



V->JIl2 < X/|^_)_i)2 { E a t)->M;j]} + (t + l) 2 {E^-+u;j} 

— (+ + 1) EE [^->u;j ~~ a f->n;j] + 71 , 1X2 E { E^-+u;.?'} • 
V ; 7 = 1 r=0 V ; 7=1 r=0 



i=l r=0 

V » ' ^ v 

Deterministic term D l +^ u Stochastic term S t v ^ u 

Here step (i) follows from the elementary inequality 



2 

{ E l b v^u-j ~ <->u-j] } < (* + 1) E 

T=0 T=0 



j a 7j->u;jJ 



B Proof of Lemma [3] 

Recall the probability density 

TYlw— ¥v \ ' ) 
weAf(v)\{u} 

defined in Step 2 of the SOSMP algorithm. Using this shorthand notation, the claim of 
LemmaUcan be re-written as [J 7 v ^ ) . u (m)](x) = (T uv (x,-), \p v -> u (m)](-)) jjx . Therefore, apply- 
ing the Cauchy- Schwartz inequality yields 

\[F v ^ u {m)](x) - [F v ^ u (m')](x)\ 2 < \\T uv (x, -)||| 2 \\p v ^ n (m) - p v -> u (m')\\ 2 L2 . 

Integrating both sides of the previous inequality over X and taking square roots yields 

\\F v -^ u {m) - F v ^ u {m')\\ L 2 < C uv \\p v ^ u (m) - p v ^ u (m')\\ L 2, 

1 /2 

where we have denoted the constant C uv := ( J x \T uv {x,y)\ 2 fi(dy) ^(dx)) 

Next step would be to upper bound the term \\p v ^ u (m) — p v ^ u (m')\\ L 2. In order to 
do so, we first show that p v ^ u (jn) is a Frechet differentiable operator on the space J^A! : — 
convhu\l{rn*,® iv _^ u)e gM' v _> u }, where 

M' v ^ u := jm^n rh v -+ u = E Y ^f [U r (T uv (-, Y))] , for some probability density /|, 

denotes the space of all feasible SOSMP messages on the directed edge (v — > u). Doing some 
calculus using the chain rule, we calculate the partial directional (Gateaux) derivative of the 
operator u (ttz) with respect to the function m w ^ v . More specifically, for an arbitrary 
function h w _> v , we have 

W _ ( m)](h ) _ ^ UseM(v)\{uM m s^v 

{M uv , PuvlL? 

/3 U vM uv . t— r . 

\<l"ui— ¥vt Puv II s—>v I L 2 1 



^ seN(v)\{u,w} 
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where M uv = YlweN'MKfv,} m-w^v Clearly the Gateaux derivative is linear and continuous. It 
is also bounded as will be shown now. Massaging the operator norm's definition, we obtain 

sup \V W p v ^ u {m)l2 = sup sup — |i 

m£M' m£M' h w ^ v EM' w ^ v \\^w-^v \\L 2 

v)\{u,w\ m s-^v{x) 

— sup 177 a \ 



n ,3 U vM uv \\ L 2 \\/3 uv n<ieAr(D)\{u,w)} m s^v\\L 2 , 

+ SU P 777 a \2 • \ b6 > 

m£A4' \ lvl uvi Puv] j_2 



Since the space X is compact, the continuous functions f3 uv and tti s —> v achieve their maximum 
over X . Therefore, the numerator of (|63[) is bounded and we only need to show that the 
denominator is bounded away from zero. 

For an arbitrary message m v ^. u G Ai' v ^ u there exist < a < 1 and a bounded probability 
density / so that 

m v -> u (x) = aml^Jx) + (1 - a) E Y ~f[T uv (x,Yj\ 

where we have introduced the shorthand T uv (-,y) := H r (T uv (-,y)). According to Lemma [fl 
we know m*^. u = Ey[r u „(-, Y)], where Y (m*). Therefore, denoting p* = p v 

we have 

m v ^ u (x) > aE Y ~ p *[r uv (x,Y)] + (1 - a) E Y ~f[T uv (x, Y)] 

= ®Y~(ap*+(l-a)f)\Fuv(x,Y)] + a E Yr ^ p * [T uv (x,Y) — T uv (x,Y)]. (64) 

On the other hand, since X is compact, we can exchange the order of expectation and pro- 
jection using Fubini's theorem to obtain 

E Y ^[T uv (;Y)-f uv (-,Y)] = mU u -n r {mUu) = K^u- 
Substituting the last equality into the bound (f64"|) yields 



(x) > inf F uv (x,y) - \A^ u (x)\. 

y&X 

Recalling the assumption (|23j) , one can conclude that the right hand side of the above inequal- 
ity is positive for all directed edges (v — > u). Therefore, the denominator of the expression ([63]) 
is bounded away from zero and more importantly sup mg _ A/( \V W p v ^ u (m)\\2 is attainable. 

Since the derivative is a bounded, linear, and continuous operator, the Gateaux and 
Frechet derivatives coincides and we can use Proposition 2 (Luenberger |16j . page 176) to 
obtain the following upper bound 

\\p v ^ u (m) - p v ^ u (m')\\ L 2 < ^ sup \\V W p v ^ u (m! + a (m - m'))\\ 2 \\m w ^ v - m' w ^ v \\ L 2. 

weJV(v)\{u} 

Setting L v ^ u - W := C uv swp m£M , \\T> W (m)| 2 and putting the pieces together yields 
\\T v -^ u (m) - F v ^ u (m')\\ L 2 < ^ L v ^ 

u;w ll^to— >v HT'vu—tv \\ L 2 5 

weN(v)\{u} 
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for all m, m' G M.' . 

The last step of the proof is to verify that m* G Ai' , and fh G .M' for all t = 1, 2, By 

definition we have m* G A-L On the other hand, unwrapping the update (|16p we obtain 

1 4-1 ~ 
a* ■ = - V 6 r+1 

1 r=0 

= J^Zk^ I T w(x,Yi) 4>j(x) n(dx) 

T=0 i=l ^ X 

Ey^p[r ra (x,y)] <f>j(x) [i(dx), 

where p denotes the empirical probability density. Therefore, m v _^ u = Y^=i a *-s>« j fij * s equal 
to n r (Ey^j5[r ui ,(-, Y)]), thereby completing the proof. 

C Proof of Lemma |4] 

We begin by taking the conditional expectation of u^ u .- : previously defined (fT5|) . given 

the Alteration Q l and with respect to the random samples {Yi,...,Yfc} [p„_>. u (m)](-). 
Exchanging the order of expectation and integral] and exploiting the result of Lemma Q3 we 
obtain 

E \PV^u;j I £*] = f [T^fh^ix) </>j(x) fi(dx) = bUu-j, (65) 

and hence EfC*^^ | = 0, for all j = 1, 2, . . . , r and all directed edges (v — >• it) G £. Also 

it is clear that C^Vj ^ s ^-measurable. Therefore, {Cui«-j}??=o f° rms a martingale difference 
sequence with respect to the filtration {G T }^° =0 . 

On the other hand, recalling the bound (|24p . we have 

k 

\%£u;j\ < T^l^uv^Ye),^)^ < Bj. (66) 

l=\ 

Moreover, exploiting the result of Lemma [T] and exchanging the order of the integration and 
expectation once more yields 

|6Un ;J l = |(E y [r w (-,F)], ^) L2 | = |E y [(r w (-,y), ^> i2 ]| < (67) 

where we have Y ~ [p„_> u (m*)](y). Therefore, the martingale difference sequence is bounded, 
in particular with 

If* 4 " 1 I < \b t+1 I 4- 1 h* I < 2B 



4 Since r utI (a;, i/)0i(a:)[p„_ >ll (m t )](j/) is absolutely integrable, we can exchange the order of the integrals using 
Fubini's theorem. 
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D Proof of Lemma [5] 

We start by uniformly upper-bounding the terms E[|T*j^J]. To do so we first need to bound 
|| A* _+ u \\ L 2. By definition we know HA^J^ = Ej=iK^ u; j - a v-^u;j?^ therefore we only 
need to control the terms <4-wj ano - a *v^u-j for j = 1, 2, . . . , r. 

By construction, we always have |6^t>. -| < Bj for all iterations t = 0,1,.... Also, as- 
suming that | %_ yu .j | < Bj, without loss of generality, a simple induction using the update 
equation (fTU|) shows that | ct* _ >JU _ - 1 < Bj for all t. Moreover, using a similar argument leading 
to (f67|) . we obtain 



a 



v—tu;j I 



{E Y [T UV (;Y)], <t>j) L *\ = \E Y [{r uv (-,Y), <t>j) L *\\ < Bj, 



where we have Y ~ [p v _>. u (rn* )](y). Therefore, putting the pieces together, recalling the 
definition gSJ) of T*^ yields 

< ^ E w £*? + ;fi IX 

meAf(n)\{u} j=i i=i 

Concatenating the previous scalar inequalities yields E[Tq +1 ] ^ + 1), for all t > 0, where 
we have defined the |£]-vector u := { £J =1 S|}(2iVl + 32). 
We now show, using an inductive argument, that 

^.AfMif:, (68) 

L s 1 - t + 1 ^ u\ y ' 

for all s = 0, 1, 2, . . . and i = 0, 1, 2, We have already established the base case s = 0. For 

some s > 0, assume that the claim holds for s — 1. By the definition of we have 



=rn;E E [ r -i] 

T=l 

- t + l It ^ u!t /' 



T=l U=l 

where the inequality follows from the induction hypothesis. We now make note of the ele- 



mentary inequalities Y1t=i l/ r ^ 1 + log*> and 



• (log^r < f(^L dx = to'Y»* foralltl>1 

^ u\t ~ J 1 u\x (u + 1)! 



from which the claim follows. 



E Proof of Lemma [6] 

Upper-bounding the term Uy^. u : By construction, we always have |fe^t^ u .,-| < Bj for 

all iterations t = 0,1,2, Moreover, assuming | c*2— >ix-j I — fy, without loss of generality, a 

simple induction on the update equation shows that |o^_ Kt . J -| < Bj for all iterations t = 0, 1, 

On this basis, we find that 

3=1 3=1 

which establishes the bound (|58aj) . 
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Upper-bounding the term V^ u : It remains to establish the bound (|58b|) on V*_± u . We 
first condition on the cr-field Q t = cr(m°, . . . ,m ( ) and take expectations over the remaining 
randomness, thereby obtaining 



3=1 

r 



3=1 



where {^->u-j}j^=i are the expansion coefficients of the function J^fm') (i.e. b\^ u .^ = 

(Fv^uifn 1 ), 4>j)L 2 ), an d we have recalled the result E[6*t^ -|£7*] = &t->n;j fr° m dHHJ) - By 
Parseval's identity, we have 



T := Y, (bl^u 

3=1 



j ^v—¥u;j) (j^v—yu'jj — }w,j / 



Here we have used the basis expansions 



m 



v-¥u — ' S Y' J a v->u;j < j ) ji an d TL r {m*^. u ) — Y, a v-^u;j l 
3=1 3=1 



Since Il^m*^) = m*_>. w and F v -+ U (m*) = rn* v ^ u , we have 

T = (If (T v ^ u (m t ) - T v ^ u (m*)) , m{,^ u - W (m* v ^ u )) L 2 - \\m^ u - W (m* v ^ u )\\ 2 L 2 
(*) 

< HlF^^m*) - F v ^ u {m*))\\ L 2 \\ml^ u -U r (m* v ^ u )\\ L 2 - \\m l v ^ u - IT {m* v ^ u )\\ 2 L2 
(a) 

< WT^uifh*) - F v ^ u (m*)\\ L 2 \\ml^ u - U r (m* v ^ u )\\ L 2 - ||m*^ u - IT 0*^J||| 2 . 

where step (i) uses the Cauchy-Schwarz inequality, and step (ii) uses the non-expansivity of 
projection. Applying the contraction condition (|27|) . we obtain 



weAf(v)\{u} 



Mv)\-i 

ml^ u - U r (m* v ^ u )\\ 2 L 2 



\mi_+ u -W(m* v ^ u )\\ L i 



- ||m*^ u -n r (m^ u )||2 2 , 

where the second step follows from the elementary inequality ab < a 2 /2 + b 2 /2 and the 
non-expansivity of projection onto the space of non-negative functions. By the Pythagorean 
theorem, we have 



t * 1 1 2 



1,2 — ||m^,^ — n r (m^„)||^ 2 + \\Tl r (m* w ^ v ) — m^^||^ 2 

- II A* II 2 -x- II A r II 2 
" W-^w^fvWL 2 ' W w^fvWL 2 ' 
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Using this equality and taking expectations, we obtain 

EOT < (i - ^^)\w^ : ) + P^Xj + i , 2(AU) | _ p - 2(AU) 

Since V^ u = 2r/*E[T], the claim follows. 
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